Density Estimation with Mercer Kernels 


William G. Macready 

Research Institute for Advanced Computer Science 
NASA Ames Research Center 
MailStop 269-4 
Moffett Field, CA 94035 
wgm0 email .arc. nasa . gov 


Abstract 

We present a new method for density estimation based on Mercer ker- 
nels. The density estimate can be understood as the density induced on 
a data manifold by a mixture of Gaussians fit in a feature space. As is 
usual, the feature space and data manifold axe defined with any suitable 
positive-definite kernel function. We modify the standard EM algorithm 
for mixtures of Gaussians to infer the parameters of the density. One 
benefit of the approach is it’s conceptual simplicity, and uniform applica- 
bility over many different types of data. Preliminary results are presented 
for a number of simple problems. 


1 Introduction 

Kernel methods have proven themselves to be an efficient and effective method for a wide 
class of machine learning problems. Kernel methods work by mapping data in some space 
X non-linearly into some feature space F, and applying relatively simple learning methods 
in the feature space. Historically, most kernel methods have been applied to supervised 
learning tasks (e.g. support vector machines [1], Gaussian processes [2]), but recent work 
has expanded their scope into unsupervised problems as well (e.g. kernel PCA [3], support 
estimation [4], etc). 

The prototypical unsupervised learning task is density estimation where we wish to infer a 
probability density px(x\Dx) that accounts for a data set Dx — {xj^ (with Xi € X) 1 . 
To date there has been relatively little use of kernel methods for density estimation. Recent 
exceptions include the use of support vector methods to estimate cumulative distribution 
functions [5]. Kernel ideas have also inspired improved variations of traditional Parzen 
window density estimation algorithms [6]. We describe a new approach to density estima- 
tion which combines the flexibility and modularity of kernel methods with the simplicity of 
EM for Gaussian mixtures in order to infer probability densities over any data space, even 
those having data elements with mixed type (e.g. discrete and continuous). 

In kernel methods we assume a mapping $ : X h- ► F taking each datum. to a point in a dp 
dimensional Euclidean space F. We indicate the mapped data by Dp = {<t>i }i!i where 

Scalar values are indicated by variables in regular font, while vectors and matrices are in bold 
font and indicated by lower- and upper-case letters respectively. 


<j>i = *(xi). If X has dimension dx then 3> maps X into a dx -dimensional manifold 
embedded in F (assuming dp > dx)\ we call this the data manifold. If the inference 
algorithm used in the feature space uses only inner products, then the only knowledge 
required of <3> is contained in the kernel function K(x, x ') = ($(x), $(x')). 2 An important 
advantage of kernel methods is that even though the inference algorithm at work in F 
operates on vectors in R dF , the method applies to any type of data X so long as we can 
identify a <£ mapping the data to F. By now there exist kernels for mapping many types of 
data, e.g. graphs, trees, symbol sequences, etc., and thus the method proposed here may be 
used to infer probability densities over all these types of data [7]. 

If the mapping $ is suited to the learning task (i.e. the features defined by $ are relevant), 
then the inference algorithm at work in F can be very simple. For the density estimation 
task considered here, we fit a mixture of Gaussian distributions to perform density esti- 
mation in the feature space. This choice offers the benefits of modelling flexibility (with 
enough Gaussians we can approximate any density), and an efficient EM algorithm for 
determining the parameters of the Gaussians. However, as we will show, even a single 
Gaussian p F (<t>) ~ /i, E) with p and E estimated from Dp is often sufficient to model 

complex structure in px- The density px is obtained from pp by simply setting p x to the 
the density induced by pp on the data manifold. This choice does mean the p x will not be 
normalized, but we can sample efficiently frompx (even if it is highly multi-modal) 3 , and 
thus estimate the normalization by Monte Carlo when it is needed. 

The paper is organized as follows. In section 2 we derive an EM algorithm to fit mixture 
of Gaussian densities in feature space by expressing the means and covariances of the 
Gaussians as linear combinations of <j>i. Optimization of an objective expressible in terms 
of kernel evaluations gives an update rule to identify the best linear combination. Section 
3 then considers how the Gaussian density in F is mapped to a density in X, and section 4 
demonstrates some results on simple problems. We conclude in section 4 with a discussion 
of work in progress and a few open problems. 


2 Gaussian Mixture Density Estimation in Feature Space 

With M mixture components, the density model in feature space has the formp F (<t>\d) = 

Em=i PmG(<t> \6 m ) where G{+ \6 m ) = |2rr exp(-(^ - n m ) T ^(4> ~ /0/2). 
The parameters of the rath mixture are 0 m = (p m) p£ mj £ m ), and we group all param- 
eters into the vector 6 = (#i , • ■ • »0 m)- The mixture probabilities must sum to 1, i.e. 

Y^m = l Pm = 1* The EM algorithm is a convenient method to determine the parame- 
ters 0 of this mixture model. It is an iterative method in which an existing guess for the 
parameters (call this 0 9 ) is updated by maximizing the average log posterior of the data 
Dp. The averaging is done over N hidden variables, z i} which indicate which mixture 
was responsible for each observation. If p(z\Dp,d 9 ) is the the cunrent estimate for the 
probability of the hidden variables given a guess for the mixture parameters, then define 
Q(0\6 9 ) = E (in p(Dp } z\$) + lnp(0)) where the expectation is performed with respect to 
p{z\Dp ) S 9 ). For Gaussian mixtures this is calculated as [8] 

• m ( n 'i 

Q(0\0 9 ) = £ | \np(6 m ) + '£Mp m G(<f>i\e m )p(Tn\<f> i ,6 9 ) f 
m— 1 l i—1 J 


2 (<p, <j>') is the usual inner product where <j>(a) is the ath component of <f>. If 

F is infinite dimensional the sum is replaced by an integral. 

3 Work in progress. 



where using Bayes rule 


p(jn\<f>i, 6 s ) 


PjnWM 
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is the probability that 4>i was generated by the mth mixture. Given the existing guess B 9 
a better guess is obtained by maximizing Q{B\$ 9 ) with respect to 6. In this formulation 
we have allowed for prior probabilities, p(B m ), on the mixture parameters. This inclusion 
is essential in order to guarantee that we obtain positive-definite covariance estimates. For 
many kernels, cLf > N so that naive estimation of £ m would result in singular covariances. 
We will not have need of priors over the means and mixture weights, and we employ an 
inverse Wishart distribution for the prior over each £ m [9]. The inverse Wishart distribu- 
tion is given by oc |£“ 1 |^ /2 exp(-atr(£^ l 1 J)/2). The role of Wishart 

parameters can be seen by maximizing the inverse Wishart distribution. The mode occurs 
at E m = ctJ/0. In what follows we will take J = I, the identity in feature space. Up to 
irrelevant constants, log p(£ m ) = O^lnlS” 1 ! - a trE" 1 )/^. Defining 


N N 

< = and S m = ]T(& - p m )(& - Pm) J p{m\<j>i,6 9 ) (2) 

1=1 i=l 

a standard calculation yields 

Q(W 9 ) = £ {< b p m + In IE" 1 1 - 1 + o^m 1 ) j • 

m= 1 ^ ' 


Since the EM algorithm only has access to inner products in F we must express Q in 
terms of the kernel K . To this end we write the mean and covariance as: fi m — Va m 
and £ m = e m I -F VB m V T where V = [<j>i ■ ■ • #n\. The parameters e m , a m and 

B m that we need to determine are respectively a scalar, an TV- vector and a positive definite 
N x N matrix. With a slight abuse of notation we set B m — (^a^Em) which we 
will determine by maximizing Q. We include a multiple of the identity to ensure that 
E m is positive-definite. It is easily verified that the inverse of the covariance is given by 
£ m = e m I + VB m V T if 

e m = l/e m and B m = -€ m B^ 2 (e m I N + B^KB^ 2 )-^ 2 

where Ijv is the IV x N identity matrix, K is the symmetric positive definite N x N Gram 
matrix given by K = V T V = [KiJ\ with Ki tj — K(xi,x.j), and B^ 2 is the Cholesky 
decomposition of B m . For future reference we also note the identities 


^ rnBrn ~F ^mB m + B m KB m = € m B m -f e m B m + B m KB m — 0. 


From these equations we may derive 

e m K- 1 +B ra = (e m K+KB m K)' 1 and e m K -1 +B m = (e m K+KB m K) -1 (3) 
which generalizes from inverses to pseudoinverses. 

With the assumed representations for /x m and £ m , the argument of the exponential in the 
mth Gaussian when evaluated at <j> x = $(x) is (<j> x ~ p m ) T E" 1 ^ - p m ) = (cf> x - 
Va. m ) T (e rri I + VB m V r )(0 x - Va m ) which is equal to 

= Qt-x FCa m ) B^k^ — Ka^) -J- im (F~ XiX — 2k I a m -j- a m K!a m ) 

= Zm(K XtX - kjK^k*) + (k* - Ka m ) T (B m + - Ka m ) 


where we have defined K X]X = if(x.x), and k* = = [iT(x, xi) ••• if(x,x^)]. 

If x is the ith data point then k^K -1 ^ = K^i = K(x i} Xi) so that using Eq. (3) the 
argument of the exponential can be written as 

(<Pi -/*m) T £m(* -A*m) = (k* - Ka m ) T (e m K + KB m K)- 1 (k i - Ka™). (4) 


Returning to the expression of Q in terms of inner products, we can show that (E^ 1 ! = 
e£f “^IKHe^K^ 1 + B m |. Thus In (E" 1 ) which contributes to Q(0 \8 9 ) is 

In IS" 1 1 = (d F - N) In e m + In |K| + In |€ m K“ 1 + B m | 

Further, exploiting Eq. (3) in Eq. (1) we see that 


p(ml<j>i,d 9 ) 


rtXnK + KB m K[~^ exp (-$(& - - Pm)) 

P 9 m\e m K + KB m K|-i/2 exp(-|(fc - /l m ) T Sm {4>i - Pm)) 


with the arguments of the exponentials given by Eq. (4). Thus, the posterior probabilities 
p(m\(f>i, 0 9 ) can be evaluated in terms of kernel values so that S m defined in Eq. (2) can be 
determined. 

Finally, we consider tr(S” 1 S m + aE^ 1 ) which also appears in Q. From the defining 
equation for E“ x 


trE m x = im tr I + tr(VB m V T ) = e m d F + tr(B m V T V) = e m d F + tr(B m K). 


Similarly, tr(E m 1 S m ) = e m tr S m +tr VB m V T S m = e m trS m +trB m V T S m V. Using 
the definition for S m , and the fact that y. m = Va m we find 

N M 

trS m = ^p{m\4>ud 9 )\x{4>i ~ Va m )(& - Va m ) T = ^p(m|&,0*)(0i - Va rn ) T (0 i - 

i=l i=l 

N 

= Wi-^Kki - Ka m ) T K- 1 (k < - Ka m ) = tr(K -1 M m ). 

i~X 

where we have defined the N x N matrix M m = \4>i>0 9 )(ki - Ka m )(k* - 

Ka m ) T . The final term, trB m V T S m V, is expressed by noting that V T S m V = 

(</>i - Va m )(^ i - Va rn ) T V = M m . Combining these results we 
find t^E^S m + aE^ 1 ) = tr((e m K” 1 + B m )(M m + aK)) + ai m (d F - N) so that 
Q{6\6 9 ) is equal to (up to constants independent of e m , a m , B m ) 

Q(0\O 9 ) = E {< ln An - + ^2^(lne1r N + Inl^K" 1 +B m |)- 

m— 1 ^ 

itr((? m K- 1 + B m )(M m + aK))} (6) 

Maximizing this with respect to 0 determines the update formulas for e m , a m , and B m . 


2.1 Maximization of Q 

To update p m we maximize Eq. (6) with respect to p m subject to the constraint that 
Jim Pm = 1. This yields 

Pm = ^f = ^Y J P{rn\4>M. 

i=l 


(7) 



Similarly setting variations dQ of Q with respect to variations dM m equal to zero gives 
dQ = tr((« m K -1 + B m )dM m ). 

Thus the optimal a m is determined by = 0. Using the definition of M m we find a„ 
is determined from dbA m = 2 p(m\fa ) O g )(aJ n K - kj )da. rn — 0 which has solution 


Ka m = y 


P(.™ 


k i or a„ 




where is a unit vector in the it h direction. 

Next we maximize Q with respect to B m to find the parameters for the covariance. Rather 
than consider variations in Q with respect to variations in B m we consider equivalently 
variations in T = + B m : 

dQ = ^±^(ln|T|) - 1 tr((dT)(M m + oK)). • 

The variation in the log determinant is given by d(ln|T|) = tr(T -1 (dT)) so that 

dQ = \ tr ( (« + /?)T _1 - M m - oK)) (dT)) . 

Thus T _1 = (M m + olK )/( n g Tl *f 0). Recalling that T = e m K -1 + B m we find 

(fmK -1 + B m ) -1 = € m K + KB m K = + oK) 

rim + p 

where we have utilized identity Eq. (3). Maximizing Q with respect to e m yields = 
a /( n m + ft)* ™s last result for e m gives a simple expression for B m? 


rim + 0 


( e i a m)(6i a m) 


The complete EM updates are thus specified with equations Eqs. (7), (8), (9). The EM 
iterations can be initialized with a R%means algorithm in feature space. 

Singular K: In many cases the above equations may not be directly applicable because K 
is singular (i.e. some fa are linearly dependent, e.g. when N > dp). Thus V is effectively 
a dp xr matrix where r < N is the number of linearly independent <f>i vectors (and the rank 
of K). This means that rather than using the full inverse of K we use it’s pseudoinverse. 
Since Eq. (3) also holds for pseudoinverses, we perform all calculations in an r dimensional 
subspace of F corresponding to the non-singular eigenvectors of K rather than the full N 
dimensional subspace. 

Shifting the Origin in Feature Space: If the {&} are far from the origin in feature space, 
the Gram matrix conveys much less useful information. In some applications, shifting the 
origin in feature space to the center of mass c = 4>i/N can improve performance 

[10]. This is accomplished in the present context by expanding the mean and covariance 
matrix in terms of <pf = fa- c. Because translations preserve inner products, the argument 
of the exponentials is unaltered if c is absorbed into a m . Thus the EM algorithm can be 
applied as before with new definitions for K c and k£. 


3 From Feature Space to Data Space 

Having determined the density p F in feature space we obtain the density in X by evaluating 
p F on the data manifold in F. 4 Explicitly, px(x\Dx) oc pf($(x)| 0). As noted earlier 

4 We might have induced a density on the data manifold as px(fa = J d<j> 6 (pc ~ $~ (</>)) pp (fa) 
where is a surjection from F to the data manifold, but this would result in non-analytic forms. 


Figure 1: (a) Probability density obtained with two Gaussians fit to 100 noisy samples 
(white dots) centered at the origin and in a circle of radius 2. Blue regions are low prob- 
ability and high regions are red. (b) Classification boundary obtained from the posterior 
probability p(m\<t> x )\ red indicates class 1 (white samples) while blue indicates class 2 
(black samples). The pair of Gaussian densities in feature space. The 2 dimensional data 
manifold is shown in the wire mesh with black dots representing the data samples. Four 
isosurfaces of constant probability are shown in varying translucencies of red. 


this density is not normalized. In fact, for kernel functions which are local and defined over 
infinite spaces, i.e. K (x, x') -* 0 as ||x — x'|| — ► oo the density asymptotes to a tiny but 
finite value and cannot be normalized. In spite of this unpleasant property we have found 
good results even for Gaussian kernels. 


4 Experiments 

As a first example we demonstrate the geometry underlying our approach for X = R 2 . 
In order to visualize the results we select the quadratic kernel JT(x, x 7 ) = [x(l)x'(l) + 
x(2)x'(2)] 2 where x(i) is the ith component of x. For this kernel one choice for the map- 
ping to feature space is <£(:r) = [$(1) $(2) $(3)] T = [&(1) 2 y/2x(l)x(2) z(l) 2 ] T for 
which dp = 3. In Figure 1(a) we plot the density obtained using a 2 component Gaussian 
mixture fit to a simple illustrative data set of 100 data points (white dots of Figure 1(a)). 
One Gaussian captures data near the origin while the other captures the halo around the ori- 
gin, see Figure 1(c). Regularization of the density is controlled by the a and ft parameters 
of the inverse Wishart prior over the covariance. In this example a ~ ,3 = i. Larger values 
of these parameters result in more spherical covariance matrices and smoother density es- 
timates. In Figure 1(b) we also plot p(m = 1\4> X ) f° r varying x. High values (red) indicate 
points assigned to the cluster at the origin and low values (blue) indicate points assigned to 




(a) (b) 


Figure 2: (a) Fit obtained with a single Gaussian in feature space to 200 samples from two 
Gaussians and a Laplace distribution, (b) Fit obtained to a sample of 30 bit strings drawn 
from a mixture of 2 equally weighted Bernoulli distributions. The solid black line is the fit 
with shaded circles showing where data points were located, and the true density is shown 
as dashed blue. The spiky nature of the plot is due to representing each bit string by its 
decimal equivalent. 


the halo. The data has been colored white or black according to its true class label. 

Next, we consider 200 samples from a mixture density in R 2 consisting of two Gaussians 
(with means —[2 2] and [2 2], weights 0.3 and 0.3, and identity covariance), and one Lapla- 
cian density (with weight 0.4) at the origin, p^(x) = (C' 1 ! exp(||C _1 x|j 1 )/2 cix with 
covariance C = [l — 0.6;— 0. 61]. A single Gaussian in feature space with a = 15 and 
0=1 and the kernel AT(x, x') = exp(-[[x - x'|| 2 )) results in the estimate shown in 2(a). 
The estimate captures the sharp peak and fat tails of the Laplacian. 

As a final example we apply the method to estimate a density over a discrete space. For 
easy visualization 5 we chose X = B 7 , and used the kernel K(x,x') = p d ( x ' x ) where 
d(x } x') is the Hamming distance between the bit strings x, and x', and -1 < p < 1 is 
a hyperparameter [7]. The results are not terribly sensitive to the value of p (as long as p 
is positive). In Figure 2(b) we plot the fit obtained with two Gaussians to 30 samples (19 
of which are distinct) from a mixture of two equally weighted Bernoulli distributions 6 p 
and 0 were arbitrarily set to 0.6 and 0.5 respectively, and a was set to 2.6 by leave one out 
cross-validation. The fit is very good, even where the estimate is high or low, it captures 
the change in probability as bits are flipped accurately. 


5 Discussion 

We have developed a conceptually simple density estimation procedure which works by fit- 
ting a mixture of Gaussian distributions in feature space, and using the density induced on 
the data manifold. The EM algorithm can be modified to easily determine the parameters 
of the Gaussians. Preliminary results on simple test problems are encouraging. The method 
scales well with the dimensionality of the data space dx, but poorly with the number of 
training examples N. It would be useful to adapt methods from other kernel methods to 

s Higher dimensions dx only affect the scaling of the algorithm through evaluation of the kernel 
function. 

6 The probability of bit i being 1 was [0.9501 0.6068 0.8913 0.4565 0.8214 0.6154 0.9218] and 
[0.2311 0.48600.76210.0185 0.44470.7919 0.7382] for the two mixtures. 


choose good subsets of the data to improve the scaling with N [11]. The other important 
improvement that should be made concerns the determination of hyperparameters. Cur- 
rently, this is difficult because we do not have access to the normalization of the density 
which depends on the hyperparameters. For general applicability, a method to overcome 
this difficulty to automatically identify hyperparameters is desirable. 

Though we have not outlined the details here, it is straightforward to modify the algorithm 
to account for data spaces having mixed types, e.g. discrete and continuous elements. This 
generalization will be reported elsewhere. The Gaussian description in feature space brings 
with it significant advantages. Firstly, because Gaussians are simple to sample from, we 
may be able use this to sample efficiently ffompx(x). We are currently fleshing this idea 
out. Secondly, classification/regression can be done by fitting a single Gaussian to the joint 
space x and y, and determining Pf v \f x This induces a (typically non-Gaussian) 

density py\x (y |x) over the data manifold. 
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